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ABSTRACT 

A problem which has recently attracted research attention is 
that of estimating the distribution of flow sizes in internet 
traffic. On high traffic links it is sometimes impossible to 
record every packet. Researchers have approached the prob- 
lem of estimating flow lengths from sampled packet data in 
two separate ways. Firstly, different sampling methodolo- 
gies can be tried to more accurately measure the desired sys- 
tem parameters. One such method is the sample- and-hold 
method where, if a packet is sampled, all subsequent packets 
in that flow are sampled. Secondly, statistical methods can 
be used to "invert" the sampled data and produce an estimate 
of flow lengths from a sample. 

In this paper we propose, implement and test two vari- 
ants on the sample-and-hold method. In addition we show 
how the sample-and-hold method can be inverted to get an 
estimation of the genuine distribution of flow sizes. Exper- 
iments are carried out on real network traces to compare 
standard packet sampling with three variants of sample-and- 
hold. The methods are compared for their ability to recon- 
struct the genuine distribution of flow sizes in the traffic. 

Categories and Subject Descriptors 

G.3 [Probability and Statistics]: Distribution func- 
tions; C.2.3 [Computer-Communication Networks]: 

Network Operations — Network monitoring 

General Terms 

Statistical Inversion, Measurement 

Keywords 

Sampling, Inference, Inversion 

1. INTRODUCTION 

Routers at the core of the internet deal with millions 
of packets per second on multiple interfaces. From a 
network operations perspective, it is vital for the ad- 
ministrators to be aware of the volume and types of 
the packets that are traversing their networks. In order 
to achieve this objective, routers are required to collect 
management information but it is impossible to keep a 



record of all the packets. Thus, given the vast amount of 
information that needs to be collected, routers sample 
the traffic stream. This means that only a subset of the 
packets traversing any interface of the router are pro- 
cessed. Today, the most commonly implemented tech- 
nique is packet sampling, where 1 packet out of every N 
is chosen on a random or periodic basis, and integrated 
into a flow record in the router memory. 

In many practical cases, packet sampling is fol- 
lowed by multiplication of the recovered statistics by N 
(N-multiplication). This simple technique can be used 
to recover a number of packet level statistics of inter- 
est. For example, the number of SYN packets, TCP 
packets, ICMP packets or packets to or from given des- 
tinations in the original trace can be estimated by this 
process. However, the distribution of flow lengths can 
not be recovered by this procedure (see Section II. 3p . 

The problem at the heart of this paper is that of re- 
covering the distribution of flow lengths from sampled 
data. The flow inversion problem amounts to math- 
ematically compensating for the effects of sampling in 
order to estimate the distribution of flow lengths which 
would have been observed in the original data. There 
has been great research activity around the flow dis- 
tribution inversion [U [2l [3] and this is discussed in 
Section 11.51 Based on the previous studies on analy- 
sis of the NetFlow performance [4], it is evident that 
N-multiplication upsets the flow level statistics of the 
original data stream [5]. 

1.1 Outline 

This paper focuses on considering sampling methods 
and their use to estimate the flow length distribution in 
real trafflc. Packet sampling has many useful statistical 
properties, but it is a hard problem to recover the flow 
distribution from packet-sampled data. This is some- 
times called the flow inversion problem. In this paper 
we investigate three techniques based on the idea of the 
sample-and-hold method This method was origi- 
nally conceived to track the largest flows in the traffic 
(with applications related to billing) [?]■ 

Section Tl. 31 gives some basic information about packet 
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sampling as applied in real situations. Section [T3] dis- 
cusses other work on the problem of inferring flow distri- 
butions from sampled data. In Section[2]we describe the 
sampling methods we use and the inversion procedure 
used to recover the original flow distribution. There 
is also a brief overview on the router memory and re- 
sources requirements. In Section |3] we have applied our 
proposed algorithms on packet traces from a backbone 
network and have looked at the performance of our al- 
gorithm. In Section |4] we have summarized our results 
and discuss potential avenues for future work. 

1.2 Definitions 

The traffic on the Internet is carried in form of Inter- 
net Protocol (IP) packets and transmitted to the des- 
tination on a hop- by- hop basis by Internet routers. In 
order to keep account of the packets belonging to the 
same application, the concept of a flow is defined by 
router manufacturers. A flow is usually defined as a 
group of packets that have the same 5-tuple (IP pro- 
tocol, source address, source port, destination address, 
destination port). 

Usually, core Internet routers carry a large number 
of flows at any given time. This pressure on the router 
is controlled by using strict rules to remove from router 
memory (export) the statistics, and thus keep the router 
memory buffer and CPU resources available to deal with 
changes in traffic patterns by avoiding the handling of 
excessively large tables of flow records. Cisco NetFlow 
[8], the dominant standard on today's routers, uses the 
following criteria for expiring flows in the cache entries: 

1. Flows which have been idle for a specifled time are 
expired and removed from the cache (15 seconds 
is default). 

2. Long lived flows are expired and removed from the 
cache (30 minutes is default). 

3. As the cache becomes full a number of heuristics 
are applied to aggressively age groups of flows si- 
multaneously. 

4. TCP connections which have reached the end of 
byte stream (FIN) or which have been reset (RST) 
will be expired. 

As will be seen in Section 12.21 the selection of these 
parameters can greatly affect the nature of the sampled 
traffic. 

After the flow records are terminated, they are grouped 
together and exported to an external aggregation point 
through a UDP (User datagram Protocol) stream. The 
collection of these NetFlow records enables system ad- 
ministrators to have a view of general trends in spatial 

^The industry sometimes uses other definitions such as 7- 
tuple format, however we choose the 5-tuple format which 
is more commonly used in the research context. 



traffic distribution, network host behavior, traffic ma- 
trix estimation, anomaly detection [5] and other rele- 
vant measurements. However, the effectiveness of these 
applications is contingent upon the quality of the flow 
level statics recovered from the actual network measure- 
ments [TU] . 

The flow distribution is the distribution of flow lengths 
in a given traffic trace. The lengths are usually ex- 
pressed in packets but sometimes in bytes. This can be 
thought of as the probability that a given ffow has a par- 
ticular length. That is, the distribution is {9i, . . . , 9m) 
where 

6i —F [Randomly chosen flow is of length i] . 

1.3 Packet sampling 

In an analysis by Cisco [11] one NetFlow-enabled ac- 
cess router used up to 68% of its CPU on processing 
flow records when an average of 65,000 flows was kept 
in memory. When sampling was used, this utilization 
was decreased by more than 82%. There are three con- 
straints on a core router which lead to the use packet 
sampling: the size of the record buffer, the CPU speed 
and the ffow record look-up time. In packet sampling, 
in order to relax the pressure on the router while col- 
lecting measurements, 1 in packets are chosen, and 
the rest are discarded. Sometimes this is done in a peri- 
odic way with every A^th packet sampled. However, in 
the literature, independent and identically distributed 
(iid) sampling with a fixed probability p is often con- 
sidered. The differences between periodic sampling and 
iid sampling can be important. Roughan [12] has shown 
iid sampling is useful in active probing and the concepts 
are also applicable to the case of passive measurement. 

There are many advantages to iid packet sampling 
and it preserves many important characteristics of the 
traffic. However, this sampling does not preserve the 
ffow length distribution. The reason for this should be 
clear but an example is illustrative. Imagine a situation 
where the ffow distribution is such that half the flows 
in the original trace are of length two and half are of 
length one (^i = 0.5 and 02 = 0.5). Imagine these 
packets are sampled in an iid manner with p — 0.5. 
Half of the flows of length one will be sampled but only 
one quarter of the flows of length two will have both 
packets sampled. Another half of the flows of length two 
will have just one packet sampled and a flnal quarter 
will have no packets sampled. In the flnal sample the 
flow distribution will be {9[ = 0.8 and 9'^ = 0.2). The 
problem of flow inversion is, therefore, deffned as the 
problem of recovering the original distribution {9i) from 
the sampled traffic. 

The choice of sampling strategy will have a large im- 
pact on the quality of the data obtained from the net- 
work. This is why, to an extent, thereason why the 
usability of NetFlow sampled data has been questioned 



2 



by researchers [13] ■ The problems with packet sampHng 
are twofoldstem from the foUowinf effects it has on sam- 
pled flows: 

1. It is easy to miss short flows altogether. This is 
due to the fact that many flows be only a few pack- 
ets long, and they may be temporally correlated. 
Thus, these constituent packets may cluster to- 
gether and totally evade the sampling process. 

2. It can be difficult to estimate the length of long 
flows. The major problem is that, for each flow, 
only a small subset of packets are seen with a given 
probability p equal to the sampling rate. Thus, 
it is not clear how many packets actually were 
present, out of which a given number Xi were been 
seen. 

3. Flows may be mis-ranked. This means that, even 
though flow A may seem to be larger than flow B 
in the sampled statistics, this is not necessarily the 
case in reality |14j . 

4. Large flows may be split into smaller ones (creating 
sparse flows) [2j . This is due to the fact that some 
long flows have a bursty nature, and thus may in- 
clude long periods of inactivity. During these pe- 
riods, they might be mistakenly expired, and any 
new packets belonging to the same flow are mis- 
takenly classified as part of a new flow. 

For applications such as billing and monitoring, the 
nai've inversion method of division of the final statistics 
by the sampling rate, or basically multiplying the final 
data by (1/p) will simply lead to inaccurate results as 
pointed in [Sj. 

On the other hand, the most important problem of 
the current NetFlow implement ation of packet sam- 
pling is the fact that many flows arc not sampled at all, 
as none of their packets are selected for sampling. In 
many cases, particularly in short-lived flows (like web 
and email applications, where a group of packets are 
sent together to reply a query), only one packet of the 
flow is captured. This results in NetFlow reports being 
dominated by single packet flows. 

1.4 Practical implementation of sampling tech- 
niques 

Even though sampling sampling techniques are used 
in order to simplify the processing of data collected at 
a router, in practice it is a complicated process in it- 
self. In a simple implementation of packet sampling in 
a router, there are various points to be considered. A 
core or large access Internet router must constantly ac- 
comodate memory and processor resource constraints. 
Even though it is desirable to keep a large number of 
records in the flow cache, the fast growth of this number 



makes flow look-up and update a challenge. Thus, when 
sampling is implemented, the operator has to decide on 
a few parameters. 

1. Sampling rate: The sampling rate has a direct im- 
pact on the quantity and the quality of the infor- 
mation formed from the data. 

2. Flow time-out: The length of the time-out can 
have an impact on intermittent trafflc flows, such 
as peer-to-peer file sharing or Instant Messaging, 
where the flows may not by transmitting packets 
at full rate the whole time. 

3. Flow expiry: If many large flows are active, the 
flow cache of the router becomes progressively full, 
leaving no space for new flows. In order to avoid 
this, a value must be chosen for the expiration 
(timeout) of the flows. 

4. Flow export frequency: If this is done too fre- 
quently, it increases the processing load on the 
router. However, if it is not done often enough, 
the loss of the UDP export packets in the path 
can effect the quality of the gathered statistics. 

5. Flow cache size: The number of flows which are 
kept at the router plays a critical role on its per- 
formance. If this number is too large, flow look-up 
time becomes a difficult. On the other hand, if it 
is too small, many flows are bound to be dropped 
and frequent expiry and time-out of flows will be 
needed. 

It can be observed that optimally setting all these val- 
ues can be a challenging task for an operator. Changes 
in any of the above parameters can effect the length and 
number of the flows which are reported by a router. In 
Section 12.21 we look at some of the effects of the men- 
tioned parameters. 

1.5 Related work 

Hohn and Veitch 3 considered in some depth the 
problem of producing an estimate of flow distribution 
from sampled packets. They first looked at methods 
for "inversion" to recreate the original flow distribution 
from the sampled packet data. They use two schemes 
to recreate the flow distribution from the packet sam- 
pled data, the flrst based upon a binomial sum and the 
second upon a Cauchy integral. These schemes can suc- 
cessfully recover the flow distribution for short length 
flows if the sampling rate p is relatively high (for exam- 
ple, more than half the packets sampled is ideal). This 
is not a flaw in the methods described, but a fundamen- 
tal limitation in the amount of information which can 
be retrieved from packets sampled in this manner. 

Following this, their paper proposes a flow sampling 
model which can be used in an offline analysis of flow 
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records formed from an unsampled packet stream. In 
this method, all the packets are recorded and formed 
into flows. Then, a subset of these recreated flows are 
sampled using iid sampling with a given probability p. 
This sampling method proves extremely successful at 
recreating the flow distribution even when the sampling 
ratio p is relatively small (say p = 0.001). However, the 
intensive computing and memory requirements makes 
the implementation of such a scheme on high speed 
routers a challenge. 

Duffleld et al |2j have looked at recovering the flow 
length distributions from a sampled packet trace. A 
scaling based. Maximum Likelihood Estimation method 
is proposed and, due to its complexity, an iterative Ex- 
pectation Maximization algorithm is tested on the avail- 
able trace files. The biggest issue encountered by the 
authors is the complexity of the process and the adjust- 
ment of the lowest order weights to reflect the under- 
lying distributions. It is re-established by the authors 
that the estimation of flow level statistics from packet 
sampled data remains an open question. 

In a subsequent paper, [15| the authors introduce 
threshold sampling as a sampling scheme that optimally 
controls the expected volume of samples and the vari- 
ance of estimators over any classification of flows. The 
proposed scheme has packet capturing performed at 
routers, followed by flow formation and export and stag- 
ing at a mediation station, and aggregation of records 
at a measurement collector. 

Ribeiro et al p] use several methods to estimate the 
flow distribution from sampled packets. They make use 
of several features of the TCP protocol, including the 
SYN flag, and the fact that sequence numbers can give 
information about the number of bytes between sam- 
pled packets. Their work uses maximum likelihood es- 
timators to flt a the distribution of flow lengths up to 
some maximum flow length (maximum flow lengths of 
twenty and one hundred are used in the paper). The 
sequence numbers in particular prove helpful in extract- 
ing information about these short and mid-length flows. 
In addition they use a technique based on the Cramcr- 
Rao bound to investigate the best possible (lowest vari- 
ance) performance of unbiased flow distribution estima- 
tors given assumptions about the information available. 

Estan and Varghese [6] propose two algorithms for 
identifying the large flows: sample and hold and mul- 
tistage filters, which take a constant number of mem- 
ory references per packet and use a small amount of 
memory. If M is the available memory, the errors of 
the algorithms are proportional to 1/M; by contrast, 
the error of an algorithm based on classical sampling 
is proportional to l/\/(M), thus providing much less 
accuracy for the same amount of memory. This scheme 
is intended for billing schemes where large flows are of 
higher interest to the operator. 



Estan et al [16] have proposed an improvement to 
NetFlow by adapting the sampling rate, enabling the 
router to keep a pre-determined number of flows in the 
cache. As a result of change in sampling rate, at each 
stage a normalization step is performed which ignores 
the packets that would not have been sampled if the 
lower sampling rates had been chosen. This scheme 
produces more concise but less accurate reports, due 
to reduction in the collected information. The con- 
stant change in sampling rate and renormalization stage 
can be an exploitable threat to the router, allowing the 
degradation of its performance performance under some 
attack scenarios. 

Barakat et al [Mj study the possibility of detection 
and ranking of the largest flows on a link. A comparison 
is made between the blind ranking method and study 
how to detect and rank the largest flows on a link. The 
results indicate that at sampling rates of higher than 1 
in a 100 it is difficult to identify the top flow with both 
methods. 

Although not strictly relevant to the inversion prob- 
lem it is worth noting that there is considerable research 
interest in the distribution of flow lengths in internet 
traffic. This is because the flow lengths are generally 
held to be heavy-tailled [T7|, that is they follow a dis- 
tribution such that 

P [Length of flow > - x"", 

where a G (0, 2) and ^ means asymptotically propor- 
tional to as a; — > oo. This means that it is not sufficient 
simply to look at the flows under a given length, ex- 
tremely long flows will also play an important part in 
the make-up of the traffic. 

2. METHODOLOGY 

Our results are based on a 30 minute long trace from 
an OC-48 link on the CAIDA [18] network on 24th April 
2003. The trace contains 47,047,240 packets from which 
an average 83% are TCP, 7% are UDP. The rest are 
usually other network layer protocols, such as ICMP. 

The sampling strategies used in this paper are re- 
ferred to as 

1. packet sampling, 

2. sample- and-hold (by byte), 

3. sample- and-hold (by packet) and 

4. sample- and-hold (by SYN). 

Sample-and-hold (by byte) is the original sample-and- 
hold technique developed in 6 . Packet sampling as has 
been previously described, is the commonly used tech- 
nique of sampling each packet in an independent man- 
ner with a given probability p. This can be contrasted 
with techniques which are also commonly used whereby 
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for a given n, every nth packet is sampled. The three 
sample- and-hold techniques are described in the next 
section. 

2.1 Sample-and-hold techniques 

Samplc-and-hold (by byte) is a sampling technique 
developed in Estan-Varghese [5]. In this technique the 
router keeps track of certain flows and samples every 
packet on these flows until their expiry. The technique 
was developed with the aim of producing a sampling 
method in which flows that carry greater traffic volume 
(sometimes called "elephant" flows) are more likely to 
be ampled than smaller flows. Once a flow is expired, 
due to one of the reasons discussed in Section [TT^ it is 
marked for export and kept in the router cache, until 
a relatively large set of flows is ready for export to an 
external aggregation point. The process proceeds, in a 
packet by packet basis, as follows. When a packet is 
seen which is part of a flow being tracked, that packet 
is sampled. If the packet is part of a flow which is 
not being tracked, then there is a probability that this 
packet will be sampled and the flow will be added to the 
list of flows being tracked. Let b be the length of the 
packet being considered in bytes. Let p be a constant 
in (0, 1). The probability of starting to sample this flow 
at the packet under consideration is Pp = 1 — {1 — p)'' . 
This is equivalent to considering sampling every byte 
with probability p. 

Sample-and-hold (by packet) is an obvious variant 
of this technique where the probability of beginning to 
sample a flow at a given packet is a constant p. This 
is equivalent to the technique from Estan-Varghese but 
with the probability fixed rather than depending on the 
length in bytes of the packet. 

Sample-and-hold (by SYN) is another sample-and- 
hold variant based on the Transmission Control Proto- 
col (TCP). A vahd TCP flow is expected to begin with 
exactly one packet with the SYN flag set. If a packet 
is not part of the set of flows being tracked and it has 
the SYN flag set then, with probability p, that packet 
is sampled and that flow added to the list of flows being 
tracked. 

The idea is that this SYN based sampling is as close 
as possible to a version of the flow-based sampling sug- 
gested by Hohn-Veitch [5] which can be implemented 
without recording every packet and producing flows from 
them before sampling. Of course, in any given traffic 
trace, some TCP flows will have their SYN flag before 
the trace collection started. Other flows may have more 
than one SYN flag. This was observed previously by 
Duffleld et al [2]. In their packet traces, they deter- 
mined the proportion of those TCP flows containing at 
least one SYN packet that contained exactly one SYN 
packet. For one data set it was 98.8%; for another one 
it was 94.6%. 



In the CAIDA data investigated here, 7% of flows 
which contained one SYN packet contained at least one 
other. 

It should be noted that not all trafflc in the traces 
analyzed is TCP traffic, and the Sample-and-hold (by 
SYN) method can only produce an estimate of the dis- 
tribution of TCP flows. However, we have examined our 
algorithm on TCP since more than 90% of the traffic in 
our trace is TCP. 

2.2 Flow termination dilemma 

Memory constraints prevent routers from keeping flows 
active for long spans of time. The flow lifetime in the 
router cache is configurable by the user. If sampling is 
not used, it is impossible to keep the flows in the buffer 
for more than a few minutes on a heavily utilized router. 
For example, authors in [lOj use a flow expiry timeout 
of 2 seconds, which they find to be the maximum before 
flow loss rates reach unacceptable levels. Figures [1] and 
[2] show the effects of the buffer size on the accuracy of 
the flow reports. It can be seen that the longer expiry 
times consistently help pick out more long flows. Ad- 
ditionally, it can be seen that the distribution obtained 
using a longer expiry time is more consistent with the 
straight line graph expected of a heavy-tailed flow dis- 
tribution, as discussed in Section [TT5l 
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Figure 1: Changes in the flow buffer can slice up 
or join flows. 

If the flow buffer memory in the router is chosen to 
be that of length t, then each section of flow F2 will 
be reported as a separate flow, and flow Fl will also 
be sliced up into smaller fiows, creating so-called sparse 
flows. However if the router can afford to have a larger 
time out for the flows (T in Figure[T]), even if the smaller 
flows of F2 are in reality individual, unrelated flows 
(though very unlikely due to the vast number of source 
ports available for TCP packets at least), they are re- 
ported as a single cumulative flow. Flow Fl will be 
correctly reported as a complete flow. 

Figure [3] displays the complimentary cumulative dis- 
tribution function (CCDF) of flow lengths on the CAIDA 
data for two different flow expiry lengths. This shows 
clearly that, as would be expected, a shorter expiry 
time reduces the probability that the longer flows can 
be seen. 
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2.3 Flow construction from trace files 
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Figure 2: Number of observed flows of a given 
length for two different timeout values on the 
CAIDA data. 
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Figure 3: The complementary cumulative distri- 
bution function for flow lengths on the CAIDA 
data using two difl"erent flow expiry times. 



To build flow records out of trace files, we emulated 
the operation of NetFlow on a general purpose com- 
puter, relaxing the real-time memory requirements usu- 
ally imposed on routers. Thus, we were able to greatly 
extend the amount of time that detailed flow records 
are kept in memory, and thus construct a baseline of 
unsampled measurements with which the results of our 
inversion procedures could be tested. However, the 
sheer amount of packets in some high-speed Internet 
core traces means that we cannot process all of it in one 
go. To address this, we divide time into analysis win- 
dows, over which flows are considered independently. 
For the 30 minute CAIDA datasets, however, we only 
used one analysis window per trace. 

The algorithm we used for the sample-and-hold tech- 
niques of Section 11.41 is detailed in Algorithm 12.11 with 
the variables explained in Table 12.31 The algorithm for 
packet sampling is similar but simpler, since it does not 
need to track flows. 

Basically, a trace file is explored and each of its pack- 
ets considered in turn. If the packet belongs to a flow 
that had been previously selected for sampling, its in- 
formation is aggregated into the current flow tables. If 
it is not, its flow is sampled with a probability p. This 
probability is calculated on the basis of the sampling 
technique: in the case of sample-and-hold (by packet), 
the same p is used for all packets, while in sample-and- 
hold (by byte) the probability of sampling a packet is 
a function of the packet length. If a given packet is se- 
lected, its 5-tuple (j) is tracked, so that new packets with 
this same 4> and within the flow expiry timeout tt are 
considered part of the same flow. 

There are two conditions that trigger complete flow 
buffer exports in Algorithm 12.11 The first one, imple- 
mented by the boolean function FlowBufferFull(), 
represents the flow buffer reaching its maximum capac- 



itjo. The other one, corresponding to FlowExport- 
TimerExpired(), represents the expiry of a flow anal- 
ysis window, that is, a periodic event over which the flow 
collection process is restarted. When a bufler export 
occurs (independently of its triggering condition) then 
the system stops tracking all flows and writes the flow 
statistics to disk. This means that, for every ?/; G 4*, the 
3-tuple {ip, Tp{'tp), Tb{\p)) is written to a flle and the rel- 
evant data structures in program memory are cleared. 

It is informative to consider the difference between 
the 5-tuple and the flow identifler -0 in Algorithm l2.1l 
If the bufler export timeout tw is significantly longer 
than tt, it may be possible to encounter two (or more) 
different flows on the same 5-tuple during the same 
analysis window (because the flow has been timed out 



^As Algorithm 12.11 was implemented for emulation, the size 
of the flow buffer is not a hard parameter, but can be mod- 
ified as appropriate. 
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and exported then seen again). Thus, -0 appends a dis- 
criminating string to (jj, so that the identity of both 
flows is maintained. As a result of this procedure, af- 
ter a flow expires its statistics are no longer updated, 
and it will be analyzed as a separate entity from other 
flows on the same 0. Of course, by choosing tt > tw, 
individual flow expiration can be completely bypassed, 
and by setting i„ larger than the time spanned by the 
trace being analyzed, flow analysis window exports can 
be completely bypassed as well. This gives Algorithm 
12.11 full flexibility to explore the influence of tt, and 
Nf on the sampled flow statistics and our proposed in- 
version techniques. 

Algorithm 2.1: BuiLDFLOWS(trace) 
while PACKETSLEFT(irace) 



' P ^ READPACKET(trace) 
j), t, Nb) <— DecodePacket(P) 



if FlowIsBeingTracked(0) 
then 

'comment: Has the flow expired? 

if {tM > tt) 
( V' ^ GetFlowID(<?!)) 
< TerminateFlow(V') 
CreateFlow(0) 

ts{<P)^t 

Tp{4>) ^ Tp(V) + 1 
else 

'comment: Is the flow going to be sampled? 
if flowSelectedForSampling(p, N^) 

't,{4>)^t 

ip <— CreateFlow(0) 
then < * ^ 

TpW ^ 1 
if FlowBufferFull(|vI'|, iV/) 

or FLOWEXPORTTlMEREXPIRED(t, 

then /E^^portFlowBufferO 



do < 



[ ResetFlowBuffer() 



For the rest of this paper we use a relatively large 
timeout tt of flve minutes for the flows. Even though 
this may be longer than the value usually applied in 
routers (of around flfteen to thirty seconds) it helps 
avoid unnecessary flow splitting. In this paper we use 
tt ~ t,^ ^ 5 minutes, so that all flow information is 



reset every 5 minutes. After this is done, a secon post- 
processing step is done where the output of this process 
is integrated as a single 30 miutes long file. This is done 
to reduce memory consumption while avoiding dropping 
flows due to lack of buffer space. 

2.4 Inverting the sampled data 

Two methods for inverting packet sampled data are 
given by Hohn and Veitch [3j. These techniques, while 
mathematically sound, are problematic in realistic cases. 
In particular, they are numerically unstable when es- 
timating longer flows or rates of sampling with small 
values of p (where p is signiflcantly less than 1/2). No 
results are presented here for inverting packet sampling 
but for an excellent discussion of the problem, the reader 
is referred to [3] . 

Inverting the sample-and-hold (by packet or by byte) 
is, on the other hand, a new problem. 

For each flow which is not being sampled, there is a 
per-packet probability p that the flow will start being 
sampled at that point. Define q — 1—p. Let (^i, 02, • ■ ■ ) 
be the original flow length distribution before sampling, 
where 6i is the probability that a randomly chosen flow 
is exactly i packets long. 

Let 9'^ be the probability that the algorithm would 
start sampling a randomly chosen stream i packets from 
the end of the stream. This is the probability that i 
packets are sampled. Note that O'q ^ 0. 




Let Xi, i e N be the distribution of flow lengths which 
can actually be observed. This can be thought of as the 
distribution 9'^ without the probability of zero length 
flows. 

Xi —P [Sample length = z| Sample length > 0] 



k=l 



■vOO 



The sum X]fc=i 9 t)e evaluated giving. 
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5-tuple corresponding to packet P 


t 


Capture time of packet P 


AT 


Number of bytes in packet P 




Trace time since the last packet on 5-tuple (j) was seen 


It 


Flow expiry timeout 


ill 


Flow identification number 




Qc»f oU ill 

set 01 ail yj 




I ("^i" P\\ Tin TTl hpT' r\T TiflP U"P'I~Q 1 Tl "H rwxT In 




Total number of bytes in flow ip 


p 


Probability of starting to follow flow (j> 




Flow buffer export timeout 




Flow buffer size in records 



Table 1: Variables for the flow construction algorithm 



Setting i — I and rearranging gives 

yq^e, = ^-^ . 

1-q + qXi 

Let C = (1 - q)/{l - j^^) = il-q + qX,) and 
therefore 

oc 

q'X, = CY,<l'^3- 

Giving the final estimate 

^ Xj - qXi+i 
C 

_ Xi — qXi^i 
^ l-q + qXi' 

This method has certain obvious weaknesses. The 
factor 1 — g + qXi is simply a normalization factor, 
the method wholly relies on the difference between Xi 
and Xij^i. It is relatively insensitive to the particu- 
lar value of p when p is near zero (which it would be 
for typical sampling rates) since the difference between 
Xi — 0.99Xi+i and Xi — 0.999X^+1 is usually not great. 
However, particularly at large flows this creates prob- 
lems. In particular, if Xi^i > Xi then the method will 
produce a negative estimate for the probability. This 
problem can be offset to some extent by pooling adja- 
cent estimates so that, instead of estimating the prob- 
ability that a flow has exactly length i, instead an es- 
timate is given of the probability that the flow has a 
length in some range . . . ,i + n. This is discussed 

in the next section. 

We did not find any obvious method of inverting the 
original Estan sample-and-hold (by byte) . The method 
used in this paper is simply to assume that the data was 
obtained from sample-and-hold by packet with p as the 
probability of sampling a packet of mean packet length 
using Estan's method. 

For SYN based sampling, the assumption that each 
TCP flow begins with exactly one SYN flag implies that 
no inversion should be needed. Unfortunately, the SYN 



based sampling will only sample TCP flows and can 
provide no information about the distribution of UDP 
flows. This is a weakness of the method. 

2.5 Logarithmic binning 

When examining the flow distribution, particularly 
for long flows, it is likely to be of more interest to know 
how many flows have a length in a given range, rather 
than the number of flows with a speciflc length. There- 
fore, we have used a pooling technique to average data 
using a logarithmic scale. The data relating to flow 
lengths is averaged over bins which contain data on a 
set of flow lengths (for example, one bin from all lengths 
from 1000 to 1100). The size of the bins are chosen so 
that they have a constant width (or as nearly as possi- 
ble given they are integer valued) on a logarithmic scale. 
This technique is sometimes known as logarithmic bin- 
ning. 

Logarithmic binning is a simple way of smoothing 
sample data which is distributed on a logscale. Let Xk 
be the number of observations of a flow with length 
k. Let m be the largest flow observed. The valued 
X2, ■ ■ ■ Xm will be combined into n < m observations 
Xi, X2, ■ . ■ , Xn- Now, let io,ii,i2, ■ ■ ■ ,in be some series 
of integers such that iq < ii < «2 • ■ ■ with = 1, 
in is larger than the largest flow length observed and 
ik+i/ik is approximately constant for large k. Now we 
can derive a series Xi , X2 , . . . , Xn giving the average 
number of observations in the range [ik-i,ik) ~ note 
that the integer ik is not in this range (but will be in 
the range of Xk+i- 

V'""^ X- 

— —. : , 

Ik - ik-1 

for k = 1,2, ... ,n. Note that, for display purposes, it 
makes sense to show the observation Xk as occurring in 
the range ik-i — 0.5 to ik — 0.5. 

Figure [4] shows the results of logarithmic binning on 
one of the data sets from Figure [2l The technique 
has two advantages for this study. Firstly, it produces 
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clearer information for large flows. In the large flows 
regime, we usually observe either one flow or no flows, 
and this can make the graph on that region harder 
to interpret. However, the logarithmic binning allows 
the graph to convey information about how many flows 
are in a given range, including those long-flow regimes 
where simple plots are usually uninformative. This also 
gives a clearer idea of the heavy-tailed nature of flow 
lengths. Secondly, it pools those estimators for which 
there is most uncertainty. Estimating the probability 
that there is a flow exactly (say) 10,005 packets long 
is a difficult task, requiring vast amounts of data and 
processing power. On the other hand, estimating the 
expected number of flows which are between 10,000 and 
11,000 packets long allows the pooling of estimators to 
produce a more accurate estimate using a smaller data 
sets, and with lesser computational demands. 



30 second flow expiry 
30 seoond flow expiry (log binning) 




100 1000 
Lenglfl of flow in packets (k) 



Figure 4: The effects of logarithmic binning. 



30 seooncj flow expiry (log binning) 
5 minufe flow expiry (log binning) 




100 1000 
Lengffi of flow in packets (k) 



Figure 5: Figure [2] replotted v^rith logarithmic 
binning. 



3. RESULTS 



The results in this paper are all obtained on real 
network trace data. The traces are sampled using the 
techniques described in the previous section. The flow 
distribution is then produced on the sampled data af- 
ter sampling inversion techniques are applied (where 
such techniques are available) in order to recreate the 
original flow distribution. This is compared with the 
correct flow distribution obtained from the unsampled 
data. In order to assist comparison, the sampling rates 
are chosen so that approximately one packet in every 
one hundred is sampled. That is, the methods used are 
all sampling approximately the same percentage of the 
data set and the storage requirements for each sampling 
method would not be dissimilar. 

The logarithmic binning method is a simple but in- 
valuable tool for the investigation of sampled flow length 
distribution. In addition to being a useful method for 
presenting the data it enables the pooling of otherwise 
unreliable estimates to get a reliable estimate over a 
range of values. 

Packet sampling is an attractive sampling scheme for 
many purposes. It allows recovery of many important 
properties of the data, however, it is difficult to recover 
flow based information. Three sample-and-hold based 
schemes are used here, based upon the original sample- 
and-hold described by Estan and Varghese [Hj (which 
is here referred to as sample-and-hold by byte). Like 
packet sampling, sample-and-hold can be implemented 
in a practical setting (for example in firmware) [19j . 

3.1 Packet sampling 

As previously stated, inversion results are not given 
here for packet based sampling. This is due to the ex- 
treme difficulty of producing a flow distribution over the 
full range of possible flow lengths from packet sampled 
data (see the discussion in Section [TTSl) . The sampling 
was performed to get one packet in one hundred by set- 
ting p = 0.01. This gives 425,014 packets sampled in 
207, 126 flows, a mean of 2.1 packets per flow. 

3.2 Sample-and-hold (by packet) 

The value of p was adjusted so that approximately 
one packet in every one hundred was sampled. The 
value of p used was 0.000014 and this gave 413, 702 
packets and 614 flows (a mean flow length of 674 packets 
per flow). 

At a higher sampling rate the inversion algorithm can 
be seen to be very good indeed. With p — 0.001 I 
10, 333, 134 of 47, 047, 240 packets were sampled on the 
CAIDA trace. This is a very high rate of sampling but 
suitable for an initial test of the sampling algorithm. 

Figure [3] shows the density function for this experi- 
ment before and after inversion compared with the un- 
sampled data. Figure[Tn]shows the distribution function 
for the same experiment. 
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I 0.6 ■ 



Packet sampling p= 0.01 - 
Unsampled data ■ 



100 1000 

Length of flow in packets (k) 



Figure 6: The impact of packet sampling with 
p = 0.01 on the flow distribution. 




Samp-and-hold by packet (p= 0.000014) - 
Samp-and-hold by packet (after inversion) ■ 
Unsampled data 



1000 10000 100000 

length of flow (k) 



Samp-and-hold by packet (p = 0.001) - 
Samp-and-hold by packet (after inversion) ■ 
Unsampled data - 



100 1000 10000 

length of flow (k) 



Figure 9: Density function for packet based 
sample-and-hold with p = 0.001 on the CAIDA 
data. 



Figure 7: Cumulative density function for 
packet based sample-and-hold with sampling of 
approximately one packet in every one hundred 
on the CAIDA data. 



Samp-and-hold by packet (p = 0.000014) - 
Samp-and-hold by packet (after inversion) ■ 
Unsampled data - 



100 1000 
iength of flow (k) 




Samp-and-hoid by packet (p = 0.001 ) - 
Samp-and-hold by packet (after Inversion) - 
Unsampled data - 



1000 
length of flow (k) 



Figure 10: Cumulative density function for 
packet based sample-and-hold with p = 0.001 on 
the CAIDA data. 



Figure 8: Density function for packet based 
sample-and-hold with sampling of approxi- 
mately one packet in every one hundred on the 
CAIDA data. 
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The sample-and-hold by packet method focuses on 
large flows. At the high samphng rate in Figures fTOl and 
[9] the sample-and-hold by packet method inverts almost 
precisely to the original distribution. However, nearly 
one in five packets were sampled and this is an unreal- 
istically high sampling rate for a highly loaded router. 
At the more realistic sampling rates shown in Figures [7] 
and[8]the algorithm still performs relatively well, partic- 
ularly at higher flow lengths. The distribution here was 
recovered from only 614 sampled flows. Another poten- 
tially useful property of the sample-and-hold by packet 
is that the packets sampled can be resampled to create 
a sample which would have been obtained from packet 
sampling with probability p (where this is the same p 
used for sample-and-hold by packet in the first place). 
This is done by sampling the first packet in each fiow 
and then performing packet sampling with probability 
p on all subsequent packets. 

3.3 Sample-and-hold (by byte) 

FigurefTTlshows the results from sampling the CAIDA 
data using the sample-and-hold (by byte) method as 
proposed by Estan and Varghese [6 . The p value has 
been tuned so that approximately one in one hundred 
packets are sampled. This gave 521,337 packets sam- 
pled in total over 527 flows, a mean flow length of 989 
packets per flow. 




Samp-and-hold by byte (1 in 100 packets) 

Samp-and-hold by byte (after inversion) 

Unsampled data 



1 00 1000 
length of flow (k) 



Figure 11: Cumulative density function for byte 
based sample-and-hold with sampling of approx- 
imately one packet in every one hundred on the 
CAIDA data. 

Sample-and-hold by byte was engineered to focus on 
the largest flows (sometimes referred to in the litera- 
ture as "elephant" flows). It is no surprise that the flow 
distribution in Figure [TT] shows this clearly. The largest 
flows are tracked. The inversion algorithm in this paper 
was designed to work with the packet based sample-and- 
hold and it is no surprise that it does not perform par- 
ticularly well in this case. As discussed, negative values 



occur in the predicted "probabilities" and the "distribu- 
tion" does not total to one as it should. This is a result 
of applying an algorithm which is not quite appropri- 
ate for the data set. Of course these issues could be 
flxed by forcing a minimum of zero and normalizing the 
distribution artificially. Nonetheless, the inverted dis- 
tribution is an improvement over the original at longer 
fiow lengths although performs poorly over short fiows. 
The distribution in Figure fTTj is reconstructed from only 
527 flows so it is, perhaps, impressive that it is as close 
to the original as it is. 

The advantages of sample-and-hold by byte are that 
it focuses clearly on those "elephant" flows which can 
dominate traffic. It has been previously studied in the 
literature and implemented in software for real sampling 
applications. On the other hand, the disadvantages are 
that no good inversion algorithm exists as yet. In addi- 
tion there is no obvious way to recover a packet sampled 
data set from the sample-and-hold by byte data. 

3.4 Sample-and-hold (by SYN) 

Sample and hold by SYN was run with p tuned to get 
approximately one packet in every one hundred. If the 
assumption of one SYN packet per flow was correct this 
would simply mean setting p = 0.01 but, as previously 
discussed, this assumption is not met in the real data. 
In this sample, 520, 116 packets were sampled in 68, 618 
fiows, a mean of 7.6 packets per flow. 




1 00 1000 
Length of flow In packets (k) 



Figure 12: Cumulative density function for SYN 
based sample-and-hold with sampling of approx- 
imately one packet in every one hundred on the 
CAIDA data. 

The sample-and-hold by SYN is actually surprisingly 
good at recovering the sampled distribution as can be 
seen in Figures [12] and [131 However, this is somewhat 
misleading. As can be seen if Figure [01 the distribution 
from all SYN flows (effectively SYN sampling at a rate 
of 1) is, in fact, very different from the distribution of 
all the flows. This is not unexpected. It seems that in 
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All SYN flows (no sampling) ■ 
SYN flows sampled 1 in 10D ■ 
All flows (no sampling) - 



1 00 1000 
Length of flow in packets (k) 



Figure 13: Cumulative density function for SYN 
based sample-and-hold with sampling of approx- 
imately one packet in every one hundred on the 
CAIDA data. 

this trace the non SYN flows (mostly UDP and some 
ICMP) are shorter and, in particular there are more 
flows which are just one or two packets long. 

It has already been noted that SYN sampling does 
not provide an unbiased estimate of SYN flows because, 
in reality, flows can have more than one SYN packet. 
Many of these flows with more than one SYN packet are 
short flows (perhaps because a flow with multiple SYN 
packets can result from trying to initiate a connection 
to a machine which is not responding). When SYN 
sample-and-hold is used then it will be more likely to 
sample those flows with multiple SYN packets (unless 
the sample rate is one, of course). The presence of such 
a protocol behaviour has fortuitously cancelled out the 
error in the other direction and the good recovery of the 
flow distribution is a product of two errors in opposite 
directions cancelling rather than a true measure of the 
success of the algorithm. This gives a large element of 
uncertainty to the use of SYN sampling as a method for 
recovering flow distributions since the basic assumption 
(that TCP flows begin with a single SYN packet) is not 
met in real data. 

A major disadvantage of sample-and-hold by SYN 
packet is that it cannot provide information about non 
TCP packets. A major advantage is that it gives an 
approximate flow distribution with no need for inversion 
because it is an approximation to flow sampling. 

4. CONCLUSIONS AND FUTURE WORK 

Producing flow distributions for sampled packet traces 
is a difficult problem. Several authors have approached 
the problem of producing flow distributions from traces 
sampled using standard packet sampling. However, dif- 
ferent sampling methods can be used to provide a sam- 
ple which makes the recovery of the flow distribution 



easier while at the same time not putting an undue 
requirement on memory and storage for the hardware 
performing the sampling. 

Of the sample-and-hold based methods studied here 
all have advantages and disadvantages. The byte based 
sample-and-hold originally proposed in [5] was intended 
to focus on the longest flows and does this better than 
any other method. An inversion to recover the original 
flow distribution has been attempted in this paper and 
is partially successful. Further work in this area might 
improve this algorithm. 

Packet based sample-and-hold has two advantages. 
Firstly, it can be inverted well to produce a reason- 
able estimate of the flow distribution even for relatively 
low sampling rates (approximately one in one hundred 
packets sampled) . Secondly, it can be resampled to get 
a packet sample and recover those quantities which can 
be measured at the packet, not flow, level. A disadvan- 
tage is that the estimated probabilities are not guaran- 
teed positive. 

SYN based sample-and-hold is near to the original 
flow-based sampling proposed by Hohn and Veitch [3] 
but has problems due to packets with more than one 
SYN flag. It is possible that further work could correct 
for this problem. However, the problem that this tech- 
nique will only ever be useful for TCP traffic remains a 
major issue. 

Our future work on this topic will focus on two main 
issues. Firstly, there is more information to be gained 
from other parts of the TCP header, notably the au- 
thors know of no results which use the FIN or RST flag 
for inversion. While these are problematic since not ev- 
ery flow terminates correctly, still it would seem that 
valuable information is contained in these flags. Sec- 
ondly, using multiple sample sources could be a rich 
topic for research. Some work has already been done in 
this area: [TJ Section 6] provides a start on this topic 
focusing on packet sampling and |20j provides another 
approach. Investigating different sampling techniques 
which might take advantage of network topology (for 
example, if samples are available from two directions on 
the same link) could provide more information which 
might be used to develop better sampling techniques 
and also to provide more information for the inversion 
problem. 
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